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Abstract. Parsing Expression Grammars (PEGs) are a formalism that 
can describe all deterministic context-free languages through a set of 
rules that specify a top-down parser for some language. PEGs are easy 
to use, and there are efficient implementations of PEG libraries in several 
programming languages. 

A frequently missed feature of PEGs is left recursion, which is commonly 
used in Context-Free Grammars (CFGs) to encode left-associative op- 
erations. We present a simple conservative extension to the semantics 
of PEGs that gives useful meaning to direct and indirect left-recursive 
rules, and show that our extensions make it easy to express left-recursive 
idioms from CFGs in PEGs, with similar results. We prove the conser- 
vativeness of these extensions, and also prove that they work with any 
left-recursive PEG. 

Keywords: parsing expression grammars, parsing, left recursion, natu- 
ral semantics, packrat parsing 

1 Introduction 

Parsing Expression Grammars (PEGs) [3] are a formalism for describing a lan- 
guage's syntax, and an alternative to the commonly used Context Free Gram- 
mars (CFGs). Unlike CFGs, PEGs are unambiguous by construction, and their 
standard semantics is based on recognizing instead of deriving strings. Further- 
more, a PEG can be considered both the specification of a language and the 
specification of a top-dovifn parser for that language. 

PEGs use the notion of limited backtracking: the parser, when faced with 
several alternatives, tries them in a deterministic order (left to right), discarding 
remaining alternatives after one of them succeeds. They also have an expressive 
syntax, based on the syntax of extended regular expressions, and syntactic pred- 
icates^ a form of unrestricted lookahead where the parser checks whether the rest 
of the input matches a parsing expression without consuming the input. 

The top-down parsing approach of PEGs means that they cannot handle left 
recursion in grammar rules, as they would make the parser loop forever. Left 
recursion can be detected structurally, so PEGs with left-recursive rules can be 



simply rejected by PEG implementations instead of leading to parsers that do 
not terminate, but the lack of support for left recursion is a restriction on the 
expressiveness of PEGs. The use of left recursion is a common idiom for express- 
ing language constructs in a grammar, and is present in published grammars for 
programming languages; the use of left recursion can make rewriting an existing 
grammar as a PEG a difficult task [TT . 

There are proposals for adding support for left recursion to PEGs, but they 
either assume a particular PEG implementation approach, packrat parsing |23| , 
or support just direct left recursion '7D.. Packrat parsing ,2. is an optimization of 
PEGs that uses memoization to guarantee linear time behavior in the presence 
of backtracking and syntactic predicates, but can be slower in practice [18114] . 
Packrat parsing is a common implementation approach for PEGs, but there are 
others [12,. Indirect left recursion is present in real grammars, and is difficult to 
untangle [T7] . 

In this paper, we present a novel operational semantics for PEGs that gives a 
well-defined and useful meaning for PEGs with left-recursive rules. The seman- 
tics is given as a conservative extension of the existing semantics, so PEGs that 
do not have left-recursive rules continue having the same meaning as they had. It 
is also implementation agnostic, and should be easily implementable on packrat 
implementations, plain recursive descent implementations, and implementations 
based on a parsing machine. 

We also introduce parse strings as a possible semantic value resulting from a 
PEG parsing some input, in parallel to the parse trees of context-free grammars. 
We show that the parse strings that left-recursive PEGs yield for the common 
left-recursive grammar idioms are similar to the parse trees we get from bottom- 
up parsers and left-recursive CFGs, so the use of left-recursive rules in PEGs 
with out semantics should be intuitive for grammar writers. 

The rest of this paper is organized as follows: Section [5] presents a brief intro- 
duction to PEGs and discusses the problem of left recursion in PEGs; Section [3] 
presents our semantic extensions for PEGs with left-recursive rules; Section |4] 
reviews some related work on PEGs and left recursion in more detail; finally. 
Section [5] presents our concluding remarks. 

2 Parsing Expression Grammars and Left Recursion 

Parsing Expression Grammars borrow the use of non-terminals and rules (or 
productions) to express context-free recursion, although all non-terminals in a 
PEG must have only one rule. The syntax of the right side of the rules, the 
parsing expressions, is borrowed from regular expressions and its extensions, 
in order to make it easier to build parsers that parse directly from characters 
instead of tokens from a previous lexical analysis step. The semantics of PEGs 
come from backtracking top-down parsers, but in PEGs the backtracking is local 
to each choice point. 

Our presentation of PEGs is slightly different from Ford's [3] , and comes from 
earlier work |12|13] . This style makes the exposition of our extensions, and their 



Empty String (empty. 1) Terminal peg (char.l) 

G[e] X — ^ (x, e) G[a] ax — ^ (x, a) 

, & # a (char.2) — (char.3) 



G[b] ax ^-SF fail G[a] s fail 

^ . . , G[P(A)] xy fa, x') G[P(A)] X ^-SP fail 

Variable — (var.l) — (var.2) 

G[A] xy -SP {y, A[x']) G[A] x -S? fail 

„ . .. G[pi] xyz ''-SP (2/2, x') G[p2] yz (z, y') 

Concatenation — =— ! peg^ — (con.l) 

G[piP2] xyz ~^ {z, x'y') 

G[pi] xy (y, x') G[p2] y fail G[pi] x fail 
(con.2) (con.3) 

G[pi P2] xy ~» fail G[pi P2] x ~> fail 

Choice '^[P^' "-^"^fa' "'^ (ord.l) ^ gb2] ^ "-S? fail ^^^^ ^^ 

G[pi/P2] xy ^ (y, x') G[pi / P2] X ~» fail 

G[pi] xy fail G[p2] ^SP {y, x') 

(ord.3) 

G[pi /p2] ~* (y, x') 

CtI/v j- '-v^^' fail , , CtI/' .rij (11. x') , 
Not Predicate — (not.l) — (not. 2) 

G[!p] X ^-SP (x, e) G[!p] xy ""-S? fail 

D i-i- "^M 2: ^-5? fail G[p1 xyz (yz, x') G[p*] yz ^-5? (z, y') 
Repetition — (rep.l) — (rep.2) 

G[p*] X ^ (x, e) G[p*] xyz (z, x'y') 



Fig. 1. Semantics of the relation 



behavior, easier to understand. We define a PEG G as a tuple {V, T, P,ps) where 
V is the finite set of non-terminals, T is the alphabet (finite set of terminals), 
P is a function from V to parsing expressions, and ps is the starting expression, 
the one that the PEG matches. Function P is commonly described through a 
set of rules of the form A p, where A gV and p is a parsing expression. 

Parsing expressions are the core of our formalism, and they are defined in- 
ductively as the empty expression e, a terminal symbol a, a non-terminal symbol 
A, a concatenation piP2 of two parsing expressions pi and p2, an ordered choice 
P1/P2 between two parsing expressions pi and p2, a repetition p* of a parsing 
expression p, or a not-prcdicatc \p of a parsing expression p. Wc leave out exten- 
sions such as the dot, character classes, strings, and the and-predicate, as their 
addition is straightforward. 



We define the semantics of PEGs via a reiation among a PEG, a parsing 
expression, a subject, and a resuit. Tlie notation G[p\ xy 'S? {y,x') means tliat 
the expression p matches the input xy, consuming the prefix x, while leaving y 
and yielding a parse string x' as the output, while resolving any non-terminals 
using the rules of G. We use G[p] xy -S? fail to express an unsuccessful match. 
The language of a PEG G is defined as all strings that G's starting expression 
consumes, that is, the set {x £ T* \ G[ps] xy ^ {y,x')}. 

Figure [T] presents the definition of using natural semantics |11I25| . as a set 
of inference rules. Intuitively, e just succeeds and leaves the subject unaffected; 
a matches and consumes itself, or fails; A tries to match the expression P{A); 
P1P2 tries to match pi, and if it succeeds tries to match p2 on the part of the 
subject that pi did not consume; pi/p2 tries to match pi, and if it fails tries to 
match P2; p* repeatedly tries to match p until it fails, thus consuming as much 
of the subject as it can; finally, \p tries to match p and fails if p succeeds and 
succeeds if p fails, in any case leaving the subject unaffected. It is easy to see 
that the result of a match is either failure or a suffix of the subject (not a proper 
suffix, as the expression may succeed without consuming anything). 

Context-Free Grammars have the notion of a parse tree, a graphical represen- 
tation of the structure that a valid subject has, according to the grammar. The 
proof trees of our semantics can have a similar role, but they have extra infor- 
mation that can obscure the desired structure. This problem will be exacerbated 
in the proof trees that our rules for left-recursion yield, and is the reason we in- 
troduce parse strings to our formalism. A parse string is roughly a linearization 
of a parse tree, and shows which non-terminals have been used in the process 
of matching a given subject. Having the result of a parse be an actual tree and 
having arbitrary semantic actions are straightforward extensions. 

When using PEGs for parsing it is important to guarantee that a given 
grammar will either yield a successful result or fail for every subject, so parsing 
always terminates. Grammars where this is true are complete [3]. In order to 
guarantee completeness, it is sufficient to check for the absence of direct or 
indirect left recursion, a property that can be checked structurally using the 
well-formed predicate from Ford |3j (abbreviated WF). 

Inductively, empty expressions and symbol expressions are always well-formed; 
a non-terminal is well-formed if it has a production and it is well-formed; a choice 
is well-formed if the alternatives are well-formed; a not predicate is well-formed if 
the expression it uses is well-formed; a repetition is well-formed if the expression 
it repeats is well-formed and cannot succeed without consuming input; finally, a 
concatenation is well-formed if either its first expression is well-formed and can- 
not succeed without consuming input or both of its expressions are well-formed. 

A grammar is well-formed if its non-terminals and starting expression are 
all well-formed. The test of whether an expression cannot succeed while not 
consuming input is also computable from the structure of the expression and 
its grammar from an inductive definition [5]. The rule for well-formedness of 
repetitions just derives from writing a repetition p* as a recursion A <— pA / e, 
so a non-well-formed repetition is just a special case of a left-recursive rule. 



Left recursion is not a problem in the popular bottom-up parsing approaches, 
and is a natural way to express several common parsing idioms. Expressing rep- 
etition using left recursion in a CFG yields a left-associative parse tree, which 
is often desirable when parsing programming languages, either because oper- 
ations have to be left-associative or because left-associativity is more efficient 
in bottom- up parsers [B]. For example, the following is a simple left-associative 
CFG for additive expressions, written in EBNF notation: 

E~^E + T I E~T I T 
T ^n\{E) 

Rewriting the above grammar as a PEG, by replacing | with the ordered 
choice operator, yields a non-well-formed PEG that does not have a proof tree 
for any subject. We can rewrite the grammar to eliminate the left recursion, 
giving the following CFG, again in EBNF (the curly brackets are metasymbols 
of EBNF notation, and express zero-or-more repetition, white the parentheses 
are terminals): 

E T{E'} 
T ^n\{E) 
E' \ -T 

This is a simple transformation, but it yields a different parse tree, and ob- 
scures the intentions of the grammar writer, even though it is possible to trans- 
form the parse tree of the non-left-recursive grammar into the left-associative 
parse tree of the left-recursive grammar. But at least we can straightforwardly 
express the non-left-recursive grammar with the following PEG: 

E^T E'* 
T^n / (E) 
E' / -T 

Indirect left recursion is harder to eliminate, and its elimination changes the 
structure of the grammar and the resulting trees even more. For example, the 
following indirectly left-recursive CFG denotes a very simplified grammar for 
1-values in a language with variables, first-class functions, and records (where x 
stands for identifiers and n for expressions): 

L -> P.x I X 
P P{n) I L 

This grammar generates x and x followed by any number of (n) or .x, as long 
as it ends with .x. An 1- value is a prefix expression followed by a field access, or 
a single variable, and a prefix expression is a prefix expression followed by an 
operand, denoting a function call, or a valid 1-value. In the parse trees for this 
grammar each (n) or .x associates to the left. 



Writing a PEG that parses the same language is difficuh. We can ehmi- 
nate the indirect left recursion on L by substitution inside P, getting P 
P{n) I P.x I X, and then eliminate the direct left recursion on P to get the 
following CFG: 

L — > P.x I X 
P x{P'} 
P' (n) I .X 

But a direct translation of this CFG to a PEG will not work because PEG 
repetition is greedy; the repetition on P' will consume the last .x of the 1- value, 
and the first alternative of L will always fail. One possible solution is to not use 
the P non-terminal in L, and encode 1- values directly with the following PEG 
(the bolded parentheses are terminals, the non-bolded parentheses are metasym- 
bols of PEGs that mean grouping) : 

L^x S* 
S^{{n) y.x 

The above uses of left recursion are common in published grammars, with 
more complex versions (involving more rules and a deeper level of indirection) 
appearing in the grammars in the specifications of Java [S] and Lua [10] . Having 
a straightforward way of expressing these in a PEG would make the process of 
translating a grammar specification from an EBNF CFG to a PEG easier and 
less error-prone. 

In the next session we will propose a semantic extension to the PEG formal- 
ism that will give meaningful proof trees to left-recursive grammars. In partic- 
ular, we want to have the straightforward translation of common left-recursive 
idioms such as left-associative expressions to yield parse strings that are similar 
in structure to parse trees of the original CFGs. 



3 Bounded Left Recursion 



Intuitively, bounded left recursion is a use of a non-terminal where we limit the 
number of left-recursive uses it may have. This is the basis of our extension 
for supporting left recursion in PEGs. We use the notation A" to mean a non- 
terminal where we can have less than n left-recursive uses, with being an 
expression that always fails. Any left-recursive use of A" will use A""-'^, any 
left-recursive use of A"~^ will use and so on, with using A° for any 

left- recursive use, so left recursion will fail for A^. 
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Table 1. Matching E with different bounds 



For the left-recursive definition E ^ E + n / n we have the following pro- 
gression, where we write expressions equivalent to i?" on the right side: 

E" ^ fail 

E'^ ^ E" + n / n = ± + n / n = n 

E'^ ^ E^ + n / n = n + n / n 

E^ E^ + n / n — {n A- n / n) + n / n 

^ E''-^ +n I n 

It would be natural to expect that increasing the bound will eventually reach 
a fixed point with respect to a given subject, but the behavior of the ordered 
choice operator breaks this expectation. For example, with a subject n+n and the 
previous PEG, E"^ will match the whole subject, while E^ will match just the 
first n. Table [1] summarizes the results of trying to match some subjects against 
E with different left-recursive bounds (they show the suffix that remains, not 
the matched prefix). 

The fact that increasing the bound can lead to matching a smaller prefix 
means we have to pick the bound carefully if we wish to match as much of the 
subject as possible. Fortunately, it is sufficient to increase the bound until the 
size of the matched prefix stops increasing. In the above example, we would pick 
1 as the bound for n, 2 as the bound for n+n, and 3 as the bound for n+n+n. 

When the bound of a non-terminal A is 1 we are effectively prohibiting a 
match via any left-recursive path, as all left-recursive uses of A will fail. 
uses A" on all its left-recursive paths, so if A" matches a prefix of length fc, 
^ra+i matching a prefix of length k or less means that either there is nothing 
to do after matching A" (the grammar is cyclic), in which case it is pointless to 
increase the bound after A", or all paths starting with A" failed, and the match 
actually used a non-left-recursive path, so ^4"+^ is equivalent with A} . Either 
option means that n is the bound that makes A match the longest prefix of the 
subject. 

We can easily see this dynamic in the E ^ E + n / n example. To match 
we have to match E^ + n jn. Assume i?" matches a prefix x of the input. 
We then try to match the rest of the input with +n, if this succeeds we will have 



Left-R,ecursive Variable 

(A, xyz) ^ £ G[P(A)] xyz Cl(A, xyz) fail] ™ (yz, x') 

G[P{A)] xyz C[(A, xyz) ^ (yz,x')] {z, (xy)') 
(Ivar.l) 

GIA] xyz C ^'-^'^ (z,Al(xyy]) 



(A,x)4:C G[P(A)] a; £[(A,x) M- fall] ™ fall 
(lvar.2) 

GIA] X C fall 



2;y) — f all , , C(A,xy) — (y.x') 
^ ' ' lvar.3) ^ ' — lvar.4 



G[A] xy C fail G[A] xy C iy,A[x']) 



Increase tSound 

G[P{A)] xyzw C[(A, xyzw) (yzw, x')] ^^^P (zw, (xy)') 

G[P(A)] xyzw C[(A,xyzw) ^ {zw,{xyy)] {w,{xyz)') , /. 
— , where y / e (inc.l) 

G[P(A)] xyzw C[{A, xyzw) 1-4- [yzw^ x')] {w, (xyz)') 

G[P{A)] X C ^-SP fall G[P{A)] xyz C[{A,xyz) ^ iz,{xyy)] ^-SP (yz,x') 

(inc. 2) (inc. 3) 



G[P(A)] X C 'iiP CiA, x) G[P{A)] xyz C[{A, xyz) ^ (z, (xy)')] '^S (z, {xyY) 



Fig. 2. Semantics for PEGs with left-recursive non-terminals 



matched cc-j-n, a prefix bigger than x. If this fails we will have matched just n, 
which is the same prefix matched by . 

Indirect, and even mutual, left recursion is not a problem, as the bounds are 
on left-recursive uses of a non-terminal, which are a property of the proof tree, 
and not of the structure of the PEG. The bounds on two mutually recursive 
non-terminals A and B will depend on which non-terminal is being matched 
first, if it is A then the bound of A is fixed while varying the bound of B, and 
vice- versa. A particular case of mutual left recursion is when a non-terminal is 
both left and right-recursive, such as, E E + E / n. In our semantics, E"^ will 
match E"~^ + E/n, where the right-recursive use of E will have its own bound. 
Later in this section we will elaborate on the behavior of both kinds of mutual 
recursion. 

In order to extend the semantics of PEGs with bounded left recursion, we 
will show a conservative extension of the rules in Figure [Jl with new rules for 
left-recursive non-terminals. For non-left-recursive non-terminals we will still use 
rules var.l and var.2, although we will later prove that this is unnecessary, and 
the new rules for non-terminals can replace the current ones. The basic idea of 
the extension is to use A^ when matching a left-recursive non-terminal A for the 
first time, and then try to increase the bound, while using a memoization table 
£ to keep the result of the current bound. We use a different relation, with its 
own inference rules, for this iterative process of increasing the bound. 



Figure [5] presents the new rules. We give the behavior of the memoization 
table £ in the usual substitution style, where C[{A,x) H> X]{B,y) ~ jC{B,y) if 
B ^ A 01 y ^ X and C[{A,x) i— ?► X]{A,x) = X otherwise. All of the rules in 
Figure [1] just ignore this extra parameter of relation "S?. We also have rules for 
the new relation responsible for the iterative process of finding the correct 
bound for a given left-recursive use of a non-terminal. 

Rules Ivar.l and lvar.2 apply the first time a left-recursive non-terminal is 
used with a given subject, and they try to match A^ by trying to match the 
production of A using fail for any left- recursive use of A (those uses will fail 
through rule lvar.3). If A^ fails we do not try bigger bounds (rule lvar.2), but 
if A^ succeeds we store the result in C and try to find a bigger bound (rule 
IvEir.l). Rule lvar.4 is used for left-recursive invocations of A" in the process of 
matching A"^^. 

Relation tries to find the bound where A matches the longest prefix. 
Which rule applies depends on whether matching the production of A using the 
memoized value for the current bound leads to a longer match or not; rule inc.l 
covers the first case, where we use relation again to continue increasing the 
bound after updating C. Rules inc. 2 and inc. 3 cover the second case, where the 
current bound is the correct one and we just return its result. 

Let us walk through an example, again using E E + n / n as our PEG, 
with n+n+n as the subject. When first matching E against n+n+n we have 
{E, n + n + n) £, as £ is initially empty, so we have to match E + ji / n against 
n+n+n with £ — {{E, n + n + n) i— fail}. We now have to match E + n against 
n+n+n, which means matching E again, but now we use rule lvar.3. The first 
alternative, E + n, fails, and we have G[E + n /n] n + n + n {{E, n + n + n) i— )• 
fail} (+n + n, n) using the second alternative, n, and rule ord.3. 

In order to finish rule Ivar.l and the initial match we have to try to increase 
the bound through relation with C — {(i?,n + n + n) i— (+n + n, n)}. This 
means we must try to match E + n / n against n+n+n again, using the new C 
When we try the first alternative and match E with n+n+n the result will be 
(+n + n, E[il]) via lvar.4, and we can then use con.l to match E + n yielding 
(+n, i?[n]+n). We have successfully increased the bound, and are in rule inc.l, 
with X — n, y ^ +n, and zw — +n. 

In order to finish rule inc.l we have to try to increase the bound again using 
relation now with C = {(i?,n + n + n) (+n, i?[n]+n)}. We try to match 
P{E) again with this new £, and this yields (e, [n]+n]+n) via lvar.4, con.l, 
and ord.l. We have successfully increased the bound and are using rule inc.l 
again, with x = n + y = +n, and zw = e. 

We are in rule inc.l, and have to try to increase the bound a third time with 
with/: = {(£;,n + n + n) ^ (e, £'[£:[n]+n]+n)}. We have to matdiE + n jn 
against n+n+n again, using this £. In the first alternative E matches and yields 
(e, i?[i?[i?[n]+n]+n]) via lvar.4, but the first alternative itself fails via con. 2. We 
then have to match E + n / n against n+n+n using ord.2, yielding (+n + n, n). 
The attempt to increase the bound for the third time failed (we are back to the 
same result we had when C — {(A,n + n + n) H> fail}), and we use rule inc. 3 



once and rule inc.l twice to propagate (e, i?[£^[n]+n]+n) back to rule Ivar.l, and 
use this rule to get the final result, G[E] n + n + n {} (e, £'[i?[£'[n]+n]+n]). 

We can see that the parse string i?[ii^[i?[n]+ii]+n] implies left-associativity 
in the + operations, as intended by the use of a left-recursive rule. 

More complex grammars, that encode different precedences and associativ- 
ities, behave as expected. For example, the following grammar has a right- 
associative -I- with a left- associative — : 

E ^ M + E / M 
M ^ M - n / n 

Matching E with n+n+n yields £:[M[n]-|-£;[Af [n]-|-£:[M[n]]]], as matching M 
against n+n+n, n+n, and n all consume just the first n while generating M[n], 
because G[M — n / n] n + n + n {(A/, n + n + n) i-> fail} (+n + n,n) via 
lvar.3, con. 3, and ord.3, and G[M — n / n] n + n + n {(A/,n + n + n) i— >■ 
(+n + n,n)} (+n + n, n) via inc. 3. The same holds for subjects n+n and n 
with different suffixes. Now, when E matches n+n+n we will have M in M + E 
matching the first n, while E recursively matching the second n+n, with M 
again matching the first n and E recursively matching the last n via the second 
alternative. 

Matching E with n-n-n will yield £'[M[Af [A/[n]— n]— n]], as M now matches 
n-n-n with a proof tree similar to our first example (E ^ E + n / n against 
n+n+n). The first alternative of E fails because M consumed the whole subject, 
and the second alternative yields the final result via ord.3 and var.l. 

The semantics of Figure [5] also handles indirect and mutual left recursion 
well. The following mutually left-recursive PEG is a direct translation of the 
CFG used as the last example of Section 2: 

L <r- P.x I X 
P ^ P{n) I L 

It is instructive to work out what happens when matching L with a subject 
such as x(n) (n) .x(n) .x. We will use our superscript notation for bounded re- 
cursion, but it is easy to check that the explanation corresponds exactly with 
what is happening with the semantics using C 

The first alternative of will fail because both alternatives of P^ fail, as 
they use P°, due to the direct left recursion on P, and i", due to the indirect left 
recursion on L. The second alternative of matches the first x of the subject. 
Now will try to match P^ again, and the first alternative of P^ fails because 
it uses while the second alternative uses L^^ and matches the first x, and so 
P^ now matches x, and we have to try P^, which will match x(n) through the 
first alternative, now using P^. P^ uses P^ and matches x(n) (n) with the first 
alternative, but P"* matches just x again, so P^ is the answer, and matches 
x(n) (n) .x via its first alternative. 

will try to match P^ again, but P^ now matches x(n) (n) .x via its sec- 
ond alternative, as it uses L^. This means P^ will match x(n) (n) .x(n), while 



will match x(n) (n) .x again, so is the correct bound, and matches 
x(n) (n) .x(n) .x, the entire subject. It is easy to see that will match just x 
again, as P^ will now match the whole subject using L^, and the first alternative 
of will fail. 

Intuitively, the mutual recursion is playing as nested repetitions, with the 
inner repetition consuming (n) and the outer repetition consuming the result 
of the inner repetition plus . x. The result is a PEG equivalent to the PEG for 
1- values in the end of Section 2 in the subjects it matches, but that yields parse 
strings that are correctly left-associative on each (n) and . x. 

We presented the new rules as extensions intended only for non-terminals 
with left-recursive rules, but this is not necessary: the Ivar rules can replace 
var without changing the result of any proof tree. If a non-terminal does not 
appear in a left-recursive position then rules Ivar. 3 and Ivar. 4 can never apply 
by definition. These rules are the only place in the semantics where the con- 
tents of C affects the result, so Ivar. 2 is equivalent to var. 2 in the absence of 
left recursion. Analogously, if G'[(P(^)] xy C[{A,xy) H- fail] ™ {y,x') then 
G[(P(j4)] xy C[{A, xy) i->- [y, x')] ^ {y, x') in the absence of left recursion, 
so we will always have G[A] xy C[{A,xy) ^ {y,x')] iy,x') via inc.3, and 
Ivar.l is equivalent to var.l. We can formalize this argument with the following 
lemma: 

Lemma 1 (Conservativeness). Given a PEG G, a parsing expression p and 

a subject xy, we have one of the following: if G\p] xy ^ X, where X is fail 

or (y, x'), then G[p\ xy C X , as long as (A, w) ^ C for any non-terminal A 
and subject w appearing as G[A] w in the proof tree of if G\p] xy ^ X. 

Proof. By induction on the height of the proof tree for G[p\ xy X. Most 
cases are trivial, as the extension of their rules with £ does not change the table. 
The interesting cases are var.l and var. 2. 

For case var. 2 we need to use rule Ivar. 2. Wc introduce {A,xy) n> fail in 
£, but G[A] xy cannot appear in any part of the proof tree of G[P{A)] xy 
fail, so we can just use the induction hypothesis. 

For case var.l wc need to use rule Ivar.l. Again we have {A,xy) t-^ fail in 
£, but we can use the induction hypothesis on G[P(A)] xy C[{A, xy) i-> fail] to 
get {y,x'). We also use inc.3 to get G[P(A)] xy C[{A,xy) {y,x') {,y,x')] 
from G[P(A)] xy jC[{A, xy) {y, x')] , using the induction hypothesis, finishing 
Ivar.l. 

A non-obvious consequence of our bounded left rcciirsion semantics is that a 
rule that mixes left and right recursion is right-associative. For example, match- 
ing E -(^ E+E / n against n+n+n yields the parse string £;[£'[n]-|-£;[E'[n]-|-£;[n]]]. 
The reason is that E"^ already matches the whole string: 

E^ ^E°+E / n = n 
E'^^E^+E/n = n-\-E/n 



Wc have the first alternative of E'^ matching n+ and then trying to match 
E witli n+n. Again we will have E^ matching the whole string, with the first 
alternative matching n+ and then matching E with n via E^. In practice this 
behavior is not a problem, as similar constructions are also problematic in parsing 
CFGs, and grammar writers are aware of them. 

An implementation of our semantics can use ad-hoc extensions to control 
associativity in this kind of PEG, by having a right-recursive use of non-terminal 
A with a pending left-recursive use match through directly instead of going 
through the regular process. Similar extensions can be used to have different 
associativities and precedences in operator grammars such as E <^ E + E / E — 
E / E * E / (E) / n. 

In order to prove that our semantics for PEGs with left-recursion gives mean- 
ing to any closed PEG (that is, any PEG G where P{A) is defined for all non- 
terminals in G) wc have to fix the case where a repetition may not terminate {p 
in p* has not failed but not consumed any input). We can add a x ^ e predicate 
to rule rep. 2 and then add a new rule: 



G[p] X C "-^ jx, e) 
G[p*] X L (x, e) 

We also need a well-founded ordering < among the elements of the left side 
of relation For the subject we can use x <y\i and only if x is a proper suffix 
of y as the order, for the parsing expression we can use p\ < p2 if and only if pi 
is a proper part of the structure of p2, and for £ we can use C[A t-^ (x, y)] < C 
if and only if either C{A) is not defined or x < z, where C(A) = (z, w). Now we 
can prove the following lemma: 

Lemma 2 (Completeness). Given a closed PEG G, a parsing expression p, 
a subject xy, and a memoization table C, we have either G[p\ xy C 'S' {y,x') 
or G[p] xy C ^fail. 

Proof. By induction on the triple {C,xy,p). It is straightforward to check that 
we can always use the induction hypothesis on the antecedent of the rules of our 
semantics. 



4 Related Work 

Warth et al. [33] describes a modification of the packrat parsing algorithm to 
support both direct and indirect left recursion. The algorithm uses the packrat 
memoization table to detect left recursion, and then begins an iterative process 
that is similar to the process of finding the correct bound in our semantics. 

Warth et al.'s algorithm is tightly coupled to the packrat parsing approach, 
and its full version, with support for indirect left recursion, is complex, as noted 
by the authors [32]. The release versions of the authors' PEG parsing library, 
OMeta [24], only implement support for direct left recursion to avoid the extra 
complexity [22]. 



The algorithm also produces surprising results with some grammars, both 
directly and indirectly left-recursive, due to the way it tries to reuse the packrat 
memoization table [T]. Our semantics does not share these issues, although it 
shows that a left-recursive packrat parser cannot index the packrat memoization 
table just by a parsing expression and a subject, as the L table is also involved. 
One solution to this issue is to have a scoped packrat memoization table, with 
a new entry to £ introducing a new scope. We believe this solution is simpler to 
implement in a packrat parser than Warth et al.'s. 

Tratt [3T] presents an algorithm for supporting direct left recursion in PEGs, 
based on Warth et al.'s, that does not use a packrat memoization table and 
does not assume a packrat parser. The algorithm is simple, although Tratt also 
presents a more complex algorithm that tries to "fix" the right-recursive bias in 
productions that have both left and right recursion, like the E ^ E + E / n 
example we discussed at the end of Section 3. We do not believe this bias is a 
problem, although it can be fixed in our semantics with ad-hoc methods. 

IronMeta [20] is a PEG library for the Microsoft Common Language Runtime, 
based on OMeta [23], that supports direct and indirect left recursion using an 
implementation of an unpublished preliminary version of our semantics. This 
preliminary version is essentially the same, apart from notational details, so 
IronMeta can be considered a working implementation of our semantics. Initial 
versions of IronMeta used Warth et al.'s algorithm for left recursion [23], but in 
version 2.0 the author switched to an implementation of our semantics, which 
he considered "much simpler and more general" [20] . 

Parser combinators [F are a top-down parsing method that is similar to 
PEGs, being another way to declaratively specify a recursive descent parser for 
a language, and share with PEGs the same issues of non-termination in the 
presence of left recursion. Frost et al. [4 describes an approach for supporting 
left recursion in parser combinators where a count of the number of left-recursive 
uses of a non-terminal is kept, and the non-terminal fails if the count exceeds 
the number of tokens of the input. We have shown in Section 3 that such an 
approach would not work with PEGs, because of the semantics of ordered choice 
(parser combinators use the same non-deterministic choice operator as CFGs). 
Ridge fl9J presents another way of implementing the same approach for handling 
left recursion, and has the same issues regarding its application to PEGs. 

ANTLR [TB] is a popular parser generator that produces top-down parsers for 
Context-Free Grammars based on LL(*), an extension of LL(k) parsing. Version 

4 of ANTLR will have support for direct left recursion that is specialized for 
expression parsers |15) . handling precedence and associativity by rewriting the 
grammar to encode a precedence climbing parser [7|. This support is heavily 
dependent on ANTLR extensions such as semantic predicates and backtracking. 

5 Conclusion 

We presented a conservative extension to the semantics of PEGs that gives an 
useful meaning for PEGs with left-recursive rules. It is the first extension that 



is not based on packrat parsing as the parsing approach, while supporting both 
direct and indirect left recursion. The extension is based on bounded left recur- 
sion, where we limit the number of left-recursive uses a non-terminal may have, 
guaranteeing termination, and we use an iterative process to find the smallest 
bound that gives the longest match for a particular use of the non-terminal. 

We also presented some examples that show how grammar writers can use our 
extension to express in PEGs common left-recursive idioms from Context-Free 
Grammars, such as using left recursion for left-associative repetition in expres- 
sion grammars, and the use of mutual left recursion for nested left-associative 
repetition. We augmented the semantics with parse strings to show how we get 
a similar structure with left-recursive PEGs that we get with the parse trees of 
left-recursive CFGs. 

Finally, we have proved the conservativeness of our extension, and also proved 
that all PEGs are complete with the extension, so termination is guaranteed for 
the parsing of any subject with any PEG, removing the need for any static 
checks of well-formedness beyond the simple check that every non-terminal in 
the grammar has a rule. 

Our semantics has already been implemented in a PEG library that uses 
packrat parsing [2^. We are now working on adapting the semantics to a PEG 
parsing machine |12j . as the first step towards an alternative implementation 
based on LPEG [9]. This implementation will incorporate ad- hoc extensions 
for controlling precedence and associativity in grammars mixing left and right 
recursion in the same rule, leading to more concise grammars. 
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